Welcome to the World of Pokemon

library(datasets)

First lets bring in the Pokemon dataset.

Pokemon = read.csv(file = 'data/Pokemon.csv')
head(Pokemon)
##   X.                  Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1  1             Bulbasaur  Grass Poison   318 45     49      49      65
## 2  2               Ivysaur  Grass Poison   405 60     62      63      80
## 3  3              Venusaur  Grass Poison   525 80     82      83     100
## 4  3 VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5  4            Charmander   Fire          309 39     52      43      60
## 6  5            Charmeleon   Fire          405 58     64      58      80
##   Sp..Def Speed Generation Legendary
## 1      65    45          1     False
## 2      80    60          1     False
## 3     100    80          1     False
## 4     120    80          1     False
## 5      50    65          1     False
## 6      65    80          1     False



Reasearch Question 1:

What is the most common type for pokemon?

Since there are secondary typing for Pokemon, we will have to combine the observations from the two variables.


But first lets see what the data shows for Pokemon that only have one type.

# this is the Type.2 for charmander
# a Pokemon that has a singular "Fire" type

Pokemon$Type.2[5]
## [1] ""

The data displays a “” or empty for a Pokemon that has to Type.2. This will muddy our data if we’re looking for the most common type, since technically “” is a valid observation as far as the code is concerned.


These are the tables for Type.1 and Type.2 variables.

table(Pokemon$Type.1)
## 
##      Bug     Dark   Dragon Electric    Fairy Fighting     Fire   Flying 
##       69       31       32       44       17       27       52        4 
##    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic     Rock 
##       32       70       32       24       98       28       57       44 
##    Steel    Water 
##       27      112
table(Pokemon$Type.2)
## 
##               Bug     Dark   Dragon Electric    Fairy Fighting     Fire 
##      386        3       20       18        6       23       26       12 
##   Flying    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic 
##       97       14       25       35       14        4       34       33 
##     Rock    Steel    Water 
##       14       22       14
# how many types are in Type.1
length(table(Pokemon$Type.1)) 
## [1] 18
# how many types are in Type.2
length(table(Pokemon$Type.2))
## [1] 19

As you can see there is an extra type in table(Pokemon$Type.2) that is “” with the number 386 to represent all of the observations/Pokemon with “” for Type.2.


This is the combined table of Type.1 and Type.2, minus the empty “” type Now we need to be careful, because this does not represent every Pokemon. As you can see when we look at the sum of the table, it shows 1214 when we only have 800 observations/Pokemon.

combinedTypeTable = (table(Pokemon$Type.1) + table(Pokemon$Type.2)[-1])


paste("The sum is: ", sum(combinedTypeTable))
## [1] "The sum is:  1214"

Finally lets see which type is the most common for pokemon, either as their Type.1 OR their Type.2.

# finds the max(table), and then finds the index of the max value, then returns both the number and its name, in our case the type

combinedTypeTable[which(combinedTypeTable == max(combinedTypeTable))]
## Water 
##   126

As you can see, the most common typing for a Pokemon is Water.

But as always a visualization will help in contextualizing this piece of information.

combinedTypeTable = sort(combinedTypeTable, decreasing=TRUE)
barplot(combinedTypeTable[1:5], main = "TOP 5 MOST COMMON TYPES IN POKEMON", ylab="Frequency", col=c("lightblue"))



Reasearch Question 2: *ADD A “Total” Option to the Final Graph that Shows Total Amount of that Type

What generation has the greatest number of each “type”(Type.1 and Type.2 combined) of Pokemon


First we need to know how many different generations are present in our data.

table(Pokemon$Generation)
## 
##   1   2   3   4   5   6 
## 166 106 160 121 165  82

There are 6 different generations of Pokemon listed in the data, with a varying number of Pokemon for each generation.

Now we need to parse the data so that we can see how many of each type is in each generation.

# table for the number of each Type.1 type for all entries where the Generation is 1
table(Pokemon$Type.1[which(Pokemon$Generation == 1)])
## 
##      Bug   Dragon Electric    Fairy Fighting     Fire    Ghost    Grass 
##       14        3        9        2        7       14        4       13 
##   Ground      Ice   Normal   Poison  Psychic     Rock    Water 
##        8        2       24       14       11       10       31
# table for the number of each Type.2 type for all entries where the Generation is 1
table(Pokemon$Type.2[which(Pokemon$Generation == 1)])
## 
##              Dark   Dragon    Fairy Fighting   Flying    Grass   Ground 
##       88        1        1        3        2       23        2        6 
##      Ice   Poison  Psychic     Rock    Steel    Water 
##        3       22        7        2        2        4
length(table(Pokemon$Type.1[which(Pokemon$Generation == 1)]))
## [1] 15
length(table(Pokemon$Type.2[which(Pokemon$Generation == 1)]))
## [1] 14

But we run in to a problem due to the way that Pokemon types are set up. A Pokemon has one or two types, and the Type.2 is valued the same as Type.1. This means that we need to count both when looking at how many types of each Pokemon are in each generation. For example the first Pokemon in the data, Bulbasaur, would count as both a “Grass” and “Poison” type, with both holding equal weight. This causes discrepancies like we see above, where some types are missing from the Type.2 table and the Type.1 table . We will have to modify the table ourselves to make this work.


Since we will most likely have to do this for every generation, lets make a function that fills in the missing “types” and returns one vector with the combined values for both Type.1 and Type.2

# this function will help us combine the two Type tables int o one table with all unique types from both Type.1 and Type.2, as well as add up duplicates


combineTypeTblByGen = function(generation){
  
  # merging the table of Type.1 and Type.2 for the generation
  mergedTypeTable = merge(table(Pokemon$Type.1[which(Pokemon$Generation == generation)]), table(Pokemon$Type.2[which(Pokemon$Generation == generation)])[-1], all= TRUE)
  
  
  # vector version of the merged tables above
  typeVec = mergedTypeTable[[2]]
  names(typeVec) = mergedTypeTable[[1]]
  
  
  
  # for loop that adds up the repeat values for the types, and then removes the duplicate name+value from the vector
  len = length(typeVec) - 1
  for(i in 1:len){
  
    if(i >= length(typeVec)){break}
  
    else if(names(typeVec[i]) == names(typeVec[i + 1])){

      typeVec[i+1] = typeVec[i] + typeVec[i+1]
      #print(typeVec[i])

      typeVec = typeVec[-c(i)]
  
    }

  }
  
  return(typeVec)
  
  
}


barplot(combineTypeTblByGen(1), las=2, main="Frequency of Types for Generation 1")

The plot above combines frequency of “types” in both Type1 and Type2 for generation 1, and because we have the function combineTypeTblByGen() we are able to replicate this for every other generation as well.


Since we have the combined tables for every generation, we can move on to comparing the generations by Type.

This next function will allow us to see how many of one specific type is in every generation. For example it will allow us to see the number of “Bug” types there are in generation 1 through 6.

# function takes in a character argument that represents a Pokemon Type and returns a vector of length 6, with each index representing the number of that type of pokemon are in that generation number

compareTypeAcrossGen = function(type){
  
  oneTypeAllGensVec = c()
  
  for(i in 1:6){
    
    #print(combineTypeTblByGen(i)[type])
    
    oneTypeAllGensVec[i] = combineTypeTblByGen(i)[type]
   
    
  }

  names(oneTypeAllGensVec) = c("Gen 1", "Gen 2", "Gen 3", "Gen 4", "Gen 5", "Gen 6")
  
  return(oneTypeAllGensVec)

}


compareTypeAcrossGen("Bug")
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    12    14    11    18     3
compareTypeAcrossGen("Dark")
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     1     8    13     7    16     3
barplot(compareTypeAcrossGen("Bug"), main="Frequency of Bug Types in Each Generation(Gen)", las=2, col = "lightgreen")

We have the comparisons across generations for individual types, but that doesn’t accomplish anything unless we want to create 18 different barplots. Instead lets use the plotly package to create an interactive barplot that shows the frequency of each type across the generations to get a broader understanding of the dataset.

typeByGenTraces[]
## [[1]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    12    14    11    18     3 
## 
## [[2]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     4     2    15     4    12     9 
## 
## [[3]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     9     9     5    12    12     3 
## 
## [[4]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     5     8     8     1     3    14 
## 
## [[5]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     9     2     9    10    17     4 
## 
## [[6]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    11     9     6    16     8 
## 
## [[7]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     4     1     8     9     9    15 
## 
## [[8]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    15    10    18    17    20    15 
## 
## [[9]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    11    16    12    12     2 
## 
## [[10]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     5     5     7     8     9     2 
## 
## [[11]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    24    15    18    18    19     4 
## 
## [[12]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    36     4     5     8     7     2 
## 
## [[13]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    18    10    28    10    16     8 
## 
## [[14]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    12     8    12     7    10     9 
## 
## [[15]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    35    18    31    15    18     9 
## 
## [[16]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     1     8    13     7    16     3 
## 
## [[17]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    23    19    14    16    21     8 
## 
## [[18]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     2     3    12    12    12     5
q2FinalGraph

Lets do some Querying

library(DBI)
library(RSQLite)
## Warning: package 'RSQLite' was built under R version 4.4.3
drv = dbDriver("SQLite")

pokemonDatabase = dbConnect(drv, dbname = "./data/veekun-pokedex.sqlite")
dbExecute(pokemonDatabase, "PRAGMA foreign_keys = on")
## [1] 0
dbListTables(pokemonDatabase)
##   [1] "abilities"                          "ability_changelog"                 
##   [3] "ability_changelog_prose"            "ability_flavor_text"               
##   [5] "ability_names"                      "ability_prose"                     
##   [7] "berries"                            "berry_firmness"                    
##   [9] "berry_firmness_names"               "berry_flavors"                     
##  [11] "characteristic_text"                "characteristics"                   
##  [13] "conquest_episode_names"             "conquest_episode_warriors"         
##  [15] "conquest_episodes"                  "conquest_kingdom_names"            
##  [17] "conquest_kingdoms"                  "conquest_max_links"                
##  [19] "conquest_move_data"                 "conquest_move_displacement_prose"  
##  [21] "conquest_move_displacements"        "conquest_move_effect_prose"        
##  [23] "conquest_move_effects"              "conquest_move_range_prose"         
##  [25] "conquest_move_ranges"               "conquest_pokemon_abilities"        
##  [27] "conquest_pokemon_evolution"         "conquest_pokemon_moves"            
##  [29] "conquest_pokemon_stats"             "conquest_stat_names"               
##  [31] "conquest_stats"                     "conquest_transformation_pokemon"   
##  [33] "conquest_transformation_warriors"   "conquest_warrior_archetypes"       
##  [35] "conquest_warrior_names"             "conquest_warrior_rank_stat_map"    
##  [37] "conquest_warrior_ranks"             "conquest_warrior_skill_names"      
##  [39] "conquest_warrior_skills"            "conquest_warrior_specialties"      
##  [41] "conquest_warrior_stat_names"        "conquest_warrior_stats"            
##  [43] "conquest_warrior_transformation"    "conquest_warriors"                 
##  [45] "contest_combos"                     "contest_effect_prose"              
##  [47] "contest_effects"                    "contest_type_names"                
##  [49] "contest_types"                      "egg_group_prose"                   
##  [51] "egg_groups"                         "encounter_condition_prose"         
##  [53] "encounter_condition_value_map"      "encounter_condition_value_prose"   
##  [55] "encounter_condition_values"         "encounter_conditions"              
##  [57] "encounter_method_prose"             "encounter_methods"                 
##  [59] "encounter_slots"                    "encounters"                        
##  [61] "evolution_chains"                   "evolution_trigger_prose"           
##  [63] "evolution_triggers"                 "experience"                        
##  [65] "genders"                            "generation_names"                  
##  [67] "generations"                        "growth_rate_prose"                 
##  [69] "growth_rates"                       "item_categories"                   
##  [71] "item_category_prose"                "item_flag_map"                     
##  [73] "item_flag_prose"                    "item_flags"                        
##  [75] "item_flavor_summaries"              "item_flavor_text"                  
##  [77] "item_fling_effect_prose"            "item_fling_effects"                
##  [79] "item_game_indices"                  "item_names"                        
##  [81] "item_pocket_names"                  "item_pockets"                      
##  [83] "item_prose"                         "items"                             
##  [85] "language_names"                     "languages"                         
##  [87] "location_area_encounter_rates"      "location_area_prose"               
##  [89] "location_areas"                     "location_game_indices"             
##  [91] "location_names"                     "locations"                         
##  [93] "machines"                           "move_battle_style_prose"           
##  [95] "move_battle_styles"                 "move_changelog"                    
##  [97] "move_damage_class_prose"            "move_damage_classes"               
##  [99] "move_effect_changelog"              "move_effect_changelog_prose"       
## [101] "move_effect_prose"                  "move_effects"                      
## [103] "move_flag_map"                      "move_flag_prose"                   
## [105] "move_flags"                         "move_flavor_summaries"             
## [107] "move_flavor_text"                   "move_meta"                         
## [109] "move_meta_ailment_names"            "move_meta_ailments"                
## [111] "move_meta_categories"               "move_meta_category_prose"          
## [113] "move_meta_stat_changes"             "move_names"                        
## [115] "move_target_prose"                  "move_targets"                      
## [117] "moves"                              "nature_battle_style_preferences"   
## [119] "nature_names"                       "nature_pokeathlon_stats"           
## [121] "natures"                            "pal_park"                          
## [123] "pal_park_area_names"                "pal_park_areas"                    
## [125] "pokeathlon_stat_names"              "pokeathlon_stats"                  
## [127] "pokedex_prose"                      "pokedex_version_groups"            
## [129] "pokedexes"                          "pokemon"                           
## [131] "pokemon_abilities"                  "pokemon_color_names"               
## [133] "pokemon_colors"                     "pokemon_dex_numbers"               
## [135] "pokemon_egg_groups"                 "pokemon_evolution"                 
## [137] "pokemon_form_generations"           "pokemon_form_names"                
## [139] "pokemon_form_pokeathlon_stats"      "pokemon_forms"                     
## [141] "pokemon_game_indices"               "pokemon_habitat_names"             
## [143] "pokemon_habitats"                   "pokemon_items"                     
## [145] "pokemon_move_method_prose"          "pokemon_move_methods"              
## [147] "pokemon_moves"                      "pokemon_shape_prose"               
## [149] "pokemon_shapes"                     "pokemon_species"                   
## [151] "pokemon_species_flavor_summaries"   "pokemon_species_flavor_text"       
## [153] "pokemon_species_names"              "pokemon_species_prose"             
## [155] "pokemon_stats"                      "pokemon_types"                     
## [157] "region_names"                       "regions"                           
## [159] "stat_names"                         "stats"                             
## [161] "super_contest_combos"               "super_contest_effect_prose"        
## [163] "super_contest_effects"              "type_efficacy"                     
## [165] "type_game_indices"                  "type_names"                        
## [167] "types"                              "version_group_pokemon_move_methods"
## [169] "version_group_regions"              "version_groups"                    
## [171] "version_names"                      "versions"

This database and tables needs to be “cleaned” or organized into the information that I actually need –> practice organizing and putting data togehter with joins
Questions to ask/analyze:
how has the stats fro certain types changed over the course of the generartions?
is this change significant?

Goal 1: Create a table/df with all relevant info like pokemon speciest, pokedex number, stats, moves, pokedex flavor text, habitat, strengths and weaknesses, enocunters, locations, abilities

Goal 2:

Pokedex app? image, stats, and reads pokedex entry? check normality for stats by type

hypothesis: flying types are significantly faster than ground, check?

Research Question 3: How does each individual stat vary across the different types. NOTE: language_id of 9 is English

When I think of flying type Pokemon, I imagine fast aerial predators, so naturally you would think that their speed stat is significantly higher than that of ground/rock types that are usually large boulders and giants. But do the stats, get it?, actually back this up? Lets Find Out!

Step 1: get a data frame with every flying type and their speed stat and every ground type and their speed stat

Lets build this dataframe step by step.
Step 1a: get a dataframe with every Pokemon and its speed stat

dbGetQuery(pokemonDatabase, 
           "PRAGMA FOREIGN_KEY_LIST(pokemon_stats)")
##   id seq   table       from to on_update on_delete match
## 1  0   0   stats    stat_id id NO ACTION NO ACTION  NONE
## 2  1   0 pokemon pokemon_id id NO ACTION NO ACTION  NONE
dbGetQuery(pokemonDatabase, "
           SELECT pokemon_id, identifier, base_stat
           FROM pokemon_stats INNER JOIN stats
           ON pokemon_stats.stat_id = stats.id
           WHERE identifier = 'speed'
           LIMIT 20")
##    pokemon_id identifier base_stat
## 1           1      speed        45
## 2           2      speed        60
## 3           3      speed        80
## 4           4      speed        65
## 5           5      speed        80
## 6           6      speed       100
## 7           7      speed        43
## 8           8      speed        58
## 9           9      speed        78
## 10         10      speed        45
## 11         11      speed        30
## 12         12      speed        70
## 13         13      speed        50
## 14         14      speed        35
## 15         15      speed        75
## 16         16      speed        56
## 17         17      speed        71
## 18         18      speed       101
## 19         19      speed        72
## 20         20      speed        97

Step 1b: dataframe with every pokemon’s type

# dbGetQuery(pokemonDatabase, "
#            SELECT *
#            FROM pokemon_types")
# 
# dbGetQuery(pokemonDatabase, "
#            PRAGMA FOREIGN_KEY_LIST(pokemon_types)")
# 
# dbGetQuery(pokemonDatabase, "
#            SELECT *
#            FROM types")
dbGetQuery(pokemonDatabase, "
           PRAGMA FOREIGN_KEY_LIST(pokemon_types)
           ")  
##   id seq   table       from to on_update on_delete match
## 1  0   0   types    type_id id NO ACTION NO ACTION  NONE
## 2  1   0 pokemon pokemon_id id NO ACTION NO ACTION  NONE
dbGetQuery(pokemonDatabase, "
           SELECT pokemon_id, slot, identifier as TYPE
           FROM pokemon_types INNER JOIN types
           ON pokemon_types.type_id = types.id
          LIMIT 20")
##    pokemon_id slot   TYPE
## 1           1    1  grass
## 2           1    2 poison
## 3           2    1  grass
## 4           2    2 poison
## 5           3    1  grass
## 6           3    2 poison
## 7           4    1   fire
## 8           5    1   fire
## 9           6    1   fire
## 10          6    2 flying
## 11          7    1  water
## 12          8    1  water
## 13          9    1  water
## 14         10    1    bug
## 15         11    1    bug
## 16         12    1    bug
## 17         12    2 flying
## 18         13    1    bug
## 19         13    2 poison
## 20         14    1    bug

Step 1c: Now that we have two dataframes with every pokemon and its speed stat and another with every Pokemon and its type, we need a dataframe that combines these two.

# initial query with all required columns
dbGetQuery(pokemonDatabase, "
           SELECT pokemon.id, pokemon.identifier, pokemon_types.pokemon_id, pokemon_types.type_id, types.identifier, pokemon_stats.stat_id, pokemon_stats.base_stat, stats.identifier
           
           FROM pokemon_types

           INNER JOIN pokemon ON pokemon_types.pokemon_id = pokemon.id
           INNER JOIN types ON pokemon_types.type_id = types.id
           INNER JOIN pokemon_stats ON pokemon_stats.pokemon_id = pokemon.id
           INNER JOIN stats ON pokemon_stats.stat_id = stats.id
           LIMIT 20
           ")
##    id identifier pokemon_id type_id identifier stat_id base_stat
## 1   1  bulbasaur          1      12      grass       1        45
## 2   1  bulbasaur          1      12      grass       2        49
## 3   1  bulbasaur          1      12      grass       3        49
## 4   1  bulbasaur          1      12      grass       4        65
## 5   1  bulbasaur          1      12      grass       5        65
## 6   1  bulbasaur          1      12      grass       6        45
## 7   1  bulbasaur          1       4     poison       1        45
## 8   1  bulbasaur          1       4     poison       2        49
## 9   1  bulbasaur          1       4     poison       3        49
## 10  1  bulbasaur          1       4     poison       4        65
## 11  1  bulbasaur          1       4     poison       5        65
## 12  1  bulbasaur          1       4     poison       6        45
## 13  2    ivysaur          2      12      grass       1        60
## 14  2    ivysaur          2      12      grass       2        62
## 15  2    ivysaur          2      12      grass       3        63
## 16  2    ivysaur          2      12      grass       4        80
## 17  2    ivysaur          2      12      grass       5        80
## 18  2    ivysaur          2      12      grass       6        60
## 19  2    ivysaur          2       4     poison       1        60
## 20  2    ivysaur          2       4     poison       2        62
##         identifier
## 1               hp
## 2           attack
## 3          defense
## 4   special-attack
## 5  special-defense
## 6            speed
## 7               hp
## 8           attack
## 9          defense
## 10  special-attack
## 11 special-defense
## 12           speed
## 13              hp
## 14          attack
## 15         defense
## 16  special-attack
## 17 special-defense
## 18           speed
## 19              hp
## 20          attack

Save the dataframe so that we can do statistics on it with R

# cleaned up query with all required info for flying and ground type pokemon
flynGrndSPD_df = dbGetQuery(pokemonDatabase, "
           SELECT pokemon.identifier AS name, types.identifier AS type, stats.identifier AS stat_name, pokemon_stats.base_stat
           
           FROM pokemon_types

           INNER JOIN pokemon ON pokemon_types.pokemon_id = pokemon.id
           INNER JOIN types ON pokemon_types.type_id = types.id
           INNER JOIN pokemon_stats ON pokemon_stats.pokemon_id = pokemon.id
           INNER JOIN stats ON pokemon_stats.stat_id = stats.id
            
           WHERE stat_name LIKE 'speed' AND (type LIKE 'flying' OR type LIKE 'ground')
           
           ")

# data frame with flying types and their speed stats
flyingSpeeds = flynGrndSPD_df[which(flynGrndSPD_df$type == "flying"), ]

# data frame with ground types and their speed stats
groundSpeeds = flynGrndSPD_df[which(flynGrndSPD_df$type == "ground"), ]

flynGrndSPD_df
##                    name   type stat_name base_stat
## 1             charizard flying     speed       100
## 2            butterfree flying     speed        70
## 3                pidgey flying     speed        56
## 4             pidgeotto flying     speed        71
## 5               pidgeot flying     speed       101
## 6               spearow flying     speed        70
## 7                fearow flying     speed       100
## 8             sandshrew ground     speed        40
## 9             sandslash ground     speed        65
## 10            nidoqueen ground     speed        76
## 11             nidoking ground     speed        85
## 12                zubat flying     speed        55
## 13               golbat flying     speed        90
## 14              diglett ground     speed        95
## 15              dugtrio ground     speed       120
## 16              geodude ground     speed        20
## 17             graveler ground     speed        35
## 18                golem ground     speed        45
## 19            farfetchd flying     speed        60
## 20                doduo flying     speed        75
## 21               dodrio flying     speed       100
## 22                 onix ground     speed        70
## 23               cubone ground     speed        35
## 24              marowak ground     speed        45
## 25              rhyhorn ground     speed        25
## 26               rhydon ground     speed        40
## 27              scyther flying     speed       105
## 28             gyarados flying     speed        81
## 29           aerodactyl flying     speed       130
## 30             articuno flying     speed        85
## 31               zapdos flying     speed       100
## 32              moltres flying     speed        90
## 33            dragonite flying     speed        80
## 34             hoothoot flying     speed        50
## 35              noctowl flying     speed        70
## 36               ledyba flying     speed        55
## 37               ledian flying     speed        85
## 38               crobat flying     speed       130
## 39              togetic flying     speed        40
## 40                 natu flying     speed        70
## 41                 xatu flying     speed        95
## 42               hoppip flying     speed        50
## 43             skiploom flying     speed        80
## 44             jumpluff flying     speed       110
## 45                yanma flying     speed        95
## 46               wooper ground     speed        15
## 47             quagsire ground     speed        35
## 48              murkrow flying     speed        91
## 49               gligar ground     speed        85
## 50               gligar flying     speed        85
## 51              steelix ground     speed        30
## 52               swinub ground     speed        50
## 53            piloswine ground     speed        50
## 54             delibird flying     speed        75
## 55              mantine flying     speed        70
## 56             skarmory flying     speed        70
## 57               phanpy ground     speed        40
## 58              donphan ground     speed        50
## 59             larvitar ground     speed        41
## 60              pupitar ground     speed        51
## 61                lugia flying     speed       110
## 62                ho-oh flying     speed        90
## 63            marshtomp ground     speed        50
## 64             swampert ground     speed        60
## 65            beautifly flying     speed        65
## 66              taillow flying     speed        85
## 67              swellow flying     speed       125
## 68              wingull flying     speed        85
## 69             pelipper flying     speed        65
## 70           masquerain flying     speed        60
## 71              nincada ground     speed        40
## 72              ninjask flying     speed       160
## 73                numel ground     speed        35
## 74             camerupt ground     speed        40
## 75             trapinch ground     speed        10
## 76              vibrava ground     speed        70
## 77               flygon ground     speed       100
## 78               swablu flying     speed        50
## 79              altaria flying     speed        80
## 80             barboach ground     speed        60
## 81             whiscash ground     speed        60
## 82               baltoy ground     speed        55
## 83              claydol ground     speed        75
## 84              tropius flying     speed        51
## 85            salamence flying     speed       100
## 86              groudon ground     speed        90
## 87             rayquaza flying     speed        95
## 88             torterra ground     speed        56
## 89               starly flying     speed        60
## 90             staravia flying     speed        80
## 91            staraptor flying     speed       100
## 92               mothim flying     speed        66
## 93               combee flying     speed        70
## 94            vespiquen flying     speed        40
## 95            gastrodon ground     speed        39
## 96             drifloon flying     speed        70
## 97             drifblim flying     speed        80
## 98            honchkrow flying     speed        71
## 99               chatot flying     speed        91
## 100               gible ground     speed        42
## 101              gabite ground     speed        82
## 102            garchomp ground     speed       102
## 103          hippopotas ground     speed        32
## 104           hippowdon ground     speed        47
## 105             mantyke flying     speed        50
## 106           rhyperior ground     speed        40
## 107            togekiss flying     speed        80
## 108             yanmega flying     speed        95
## 109             gliscor ground     speed        95
## 110             gliscor flying     speed        95
## 111           mamoswine ground     speed        80
## 112              pidove flying     speed        43
## 113           tranquill flying     speed        65
## 114            unfezant flying     speed        93
## 115              woobat flying     speed        72
## 116             swoobat flying     speed       114
## 117             drilbur ground     speed        68
## 118           excadrill ground     speed        88
## 119           palpitoad ground     speed        69
## 120          seismitoad ground     speed        74
## 121             sandile ground     speed        65
## 122            krokorok ground     speed        74
## 123          krookodile ground     speed        92
## 124            sigilyph flying     speed        97
## 125              archen flying     speed        70
## 126            archeops flying     speed       110
## 127            ducklett flying     speed        55
## 128              swanna flying     speed        98
## 129              emolga flying     speed       103
## 130            stunfisk ground     speed        32
## 131              golett ground     speed        35
## 132              golurk ground     speed        55
## 133             rufflet flying     speed        60
## 134            braviary flying     speed        80
## 135             vullaby flying     speed        60
## 136           mandibuzz flying     speed        80
## 137  tornadus-incarnate flying     speed       111
## 138 thundurus-incarnate flying     speed       111
## 139  landorus-incarnate ground     speed       101
## 140  landorus-incarnate flying     speed       101
## 141           diggersby ground     speed        78
## 142          fletchling flying     speed        62
## 143         fletchinder flying     speed        84
## 144          talonflame flying     speed       126
## 145            vivillon flying     speed        89
## 146            hawlucha flying     speed       118
## 147              noibat flying     speed        55
## 148             noivern flying     speed       123
## 149             yveltal flying     speed        99
## 150             zygarde ground     speed        95
## 151      wormadam-sandy ground     speed        36
## 152         shaymin-sky flying     speed       127
## 153           rotom-fan flying     speed        86
## 154    tornadus-therian flying     speed       121
## 155   thundurus-therian flying     speed       101
## 156    landorus-therian ground     speed        91
## 157    landorus-therian flying     speed        91
## 158    charizard-mega-y flying     speed       100
## 159         pinsir-mega flying     speed       105
## 160     aerodactyl-mega flying     speed       150
## 161       garchomp-mega ground     speed        92
## 162       swampert-mega ground     speed        70
## 163        steelix-mega ground     speed        30
## 164        pidgeot-mega flying     speed       121
## 165      groudon-primal ground     speed        90
## 166       rayquaza-mega flying     speed       115
## 167       camerupt-mega ground     speed        20
## 168      salamence-mega flying     speed       120

In order to figure out what hypothesis test to choose, lets figure out if our data follows a normal distribution.

hist(flyingSpeeds$base_stat, main = "Flying Type Speed Distribution", xlab = "Speed stat")

hist(groundSpeeds$base_stat, main = "Ground Type Speed Distribution", xlab = "Speed stat")

qqnorm(flyingSpeeds$base_stat, main  = "Normal Q-Q Plot for Flying Types")
qqline(flyingSpeeds$base_stat, col = "red")

qqnorm(groundSpeeds$base_stat, main  = "Normal Q-Q Plot for Ground Types")
qqline(groundSpeeds$base_stat, col = "red")

While the data doesnt appear to be normal, it helps to do a formal test to make sure that we are making an accurate interpretation of our data that isnt solely reliant on visuals.
In order to do that, we’ll conduct a Shapiro Wilk Test on a random sample from both groups.

First we need to take random samples from both groups with sample size n = 30.

flySpeedSample = sample(groundSpeeds$base_stat, 30, replace = TRUE)
grndSpeedSample = sample(flyingSpeeds$base_stat, 30, replace = TRUE)

Then we can use R’s built in function for the Shapiro test for normality.

shapiro.test(flySpeedSample)
## 
##  Shapiro-Wilk normality test
## 
## data:  flySpeedSample
## W = 0.90575, p-value = 0.01164
shapiro.test(grndSpeedSample)
## 
##  Shapiro-Wilk normality test
## 
## data:  grndSpeedSample
## W = 0.97671, p-value = 0.733

Lets interpret these results: For info, the Shapiro Wilk test is a hypothesis test where the null hypothesis is that a random sample comes from a normally distributed dataset.

The W statistic is the result of the Shapiro Wilk Test, which measures how close the sample matches a normal distribution, with 1 being normal and 0 being not normal. The P value determines if we can reject the null hypothesis that the sample comes from a normally distributed dataset. If the p value result is <= 0.05, we can reject the null hypothesis, however if the p value result is > 0.05 we are unable to reject the null hypothesis that the random sample comes from a normally distributed dataset.

However, we encounter a problem. Depending on the random sample, the samples either fail or pass the test. We want to see if it is passing this test a reliable number of times. So lets build a function that does the shapiro wilk test on the 100 different samples, and if 95% of them are passing, we can reasonable say that they pass the shapiro wilk test.

shapiroSample = function(dataset, reps, sampleSize){
  

  pass_vec = c()
  for(i in 1:reps){
    sample = sample(dataset, sampleSize, replace = TRUE)
    outcome = shapiro.test(sample)
    
    # checks to see if this sample rejects the null hypothesis or not
    # if it does adds a 0 for fail, and if not adds a 1 for pass
    
    if(outcome[2] > 0.05){pass_vec[i] = 1}
    else if(outcome[2] <= 0.05){pass_vec[i] = 0}
    
  }
  
  return(pass_vec)
}

Now lets execute this function for our data.

mean(shapiroSample(flyingSpeeds$base_stat, 100, 30))
## [1] 0.86
mean(shapiroSample(groundSpeeds$base_stat, 100, 30))
## [1] 0.69

Since the percentage of samples out of 100 that pass the Shapiro Wilk test are nowhere close to 95%, or even 90%, for both flying and ground types, we can say that our data is not normally distributed.

So what do we do now?
We need to find an appropriate statistical test that will give us the findings we are looking for without normality of the data being a requirement.

dbDisconnect(pokemonDatabase)